LanguageCrawl: A Generic Tool for Building Language Models Upon Common-Crawl
نویسندگان
چکیده
The web data contains immense amount of data, hundreds of billion words are waiting to be extracted and used for language research. In this work we introduce our tool LanguageCrawl which allows Natural Language Processing (NLP) researchers to easily construct web-scale corpus the from Common Crawl Archive: a petabyte scale open repository of web crawl information. Three use-cases are presented: filtering Polish websites, building N-gram corpora and training continuous skip-gram language model with hierarchical softmax. Each of them has been implemented within the LanguageCrawl toolkit, with the possibility to adjust specified language and N-gram ranks. Special effort has been put on high computing efficiency, by applying highly concurrent multitasking. We make our tool publicly available to enrich NLP resources. We strongly believe that our work will help to facilitate NLP research, especially in under-resourced languages, where the lack of appropriately sized corpora is a serious hindrance to applying data-intensive methods, such as deep neural networks.
منابع مشابه
N-gram Counts and Language Models from the Common Crawl
We contribute 5-gram counts and language models trained on the Common Crawl corpus, a collection over 9 billion web pages. This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate. By preserving singletons, we were able to use Kneser-Ney smoothing to build large language models. This paper describes how the c...
متن کاملExplorer Edinburgh ’ s Phrase - based Machine Translation Systems for WMT - 14
This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) us...
متن کاملEdinburgh's Phrase-based Machine Translation Systems for WMT-14
This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) us...
متن کاملExploring Rhetorical-Discursive Moves in Hassan Rouhani’s Inaugural Speech: A Eulogy for Moderation
Before a president practically begins his four-year term of office in Iran, a formal inaugural ceremony is held in the parliament. Being attended by national dignitaries and representatives from other countries, the inauguration of Iran's seventh president, Hasan Rouhani, was spectacular in several respects. The current study aimed at investigating the generic structure and rhetorical moves tha...
متن کاملConceptual Metaphoric Language Use in Structuring Political Discourse in Iran-West Relations: A CDA Perspective
The present study was carried out with the purpose of examining the role of metaphorical language in the critical discourse analysis (CDA) of political texts based on a modern framework postulated by Kövecses (2015). The corpus of the study consisted of thirty-thousand words chosen as a textual sample to see which source conceptual domains are used and what generic/discursive attributes emerge ...
متن کامل